The R console looks like this:
Make sure that you set up a folder for this class.
You can knit the file. The first time you do this you will need to make sure you have the knitr package installed. You have the option to knit into .html, .pdf, and .doc. In general, in this course we will be knitting into .html.
To make something “code-looking” we use the grave accent ` found in the upper left of your keyboard.
To create a header, place a hash tag at the start of the line. For example, # Header 1 or create a level 2 header using ## Header Level 2.
To make text italics put asterisk around the text *like this*. To make text bold, put two asterisks around the text **like this**.
To make a list, just start creating your list using a - or * for each bullet, like this:
- list item 1
- list item 2
It is important that there is a blank line before the first bullet.
Add a link with the follwing code:
[Alt text that will display](www.google.com)
It will display like this:
Add an image with the following code:

It will display like this:
Alt text
The vast majority of markdown syntax are available in the RStudio RMarkdown Cheatsheet, Section 3.
Create an R chunk:
2+2
## [1] 4
OR
x<-4
echo=T or echo=F– determines whether or not to echo the source code in the output file. This can be useful if you are creating a document for someone to read that doesn’t need to see or doesn’t want to see you code, just the output. In general in this course for assignments I would like your code to be echoed. The default is echo=F.
results=T or results=F – determines whether or not the results will be displayed. This can be useful if you want to show code, but don’t care what the output is. The default is eval=T.
eval=T or eval=F – determines whether or not to evaluate the code. This can be useful if you have a whole chunk of code you don’t want run, but you also don’t want to. The default is eval=T.
There are many, many more options including fig.width, fig.height, cache, etc. The vast majority of options are available in the RStudio RMarkdown Cheatsheet, Section 5.
You have the option to set the options individually on each chunk and/or set the global options by using the code knitr::opts_chunk$set(your options here)) in the first code chunk.
Rather than using a code chunk (which is centered in the middle of the page), you also have to options to use inline code. You can place the following within any sentence or paragraph.
`r codehere`
For example,
This is the number `r x`.
becomes… This is the number 4.
Packages can contain lots of things including: data sets, functions, etc.
You can install packages using the packages tab or you can use the code install.packages('packageyouwant') in the console.
In each new R session where you want to use the package you will have to load it by typing library('packageyouwant') in the console (or in the RMarkdown document - more later).
To get help with a package (or a function in a package) you can type ?packagename into the console.
Assigning Variables:
x<-56
Calculations:
y <- x*2 #multiply
# note that because value is assigned to y, it won't print out
y #prints out the value of y
## [1] 112
x/2 #divide
## [1] 28
x^2 #x to the power of 2
## [1] 3136
Vectors:
# c() function: concatenate
heights <- c(67, 100, 34, 78, 80)
Referencing Elements of a Vector:
heights[3]
## [1] 34
Adding to Vectors:
heights <- c(heights, 90)
heights
## [1] 67 100 34 78 80 90
From a file on your computer:
airbnb <- read.csv("NYCairbnb2019.csv")
From the web:
library("openintro")
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
##
## cars, trees
cars
## type price mpgCity driveTrain passengers weight
## 1 small 15.9 25 front 5 2705
## 2 midsize 33.9 18 front 5 3560
## 3 midsize 37.7 19 front 6 3405
## 4 midsize 30.0 22 rear 4 3640
## 5 midsize 15.7 22 front 6 2880
## 6 large 20.8 19 front 6 3470
## 7 large 23.7 16 rear 6 4105
## 8 midsize 26.3 19 front 5 3495
## 9 large 34.7 16 front 6 3620
## 10 midsize 40.1 16 front 5 3935
## 11 midsize 15.9 21 front 6 3195
## 12 large 18.8 17 rear 6 3910
## 13 large 18.4 20 front 6 3515
## 14 large 29.5 20 front 6 3570
## 15 small 9.2 29 front 5 2270
## 16 small 11.3 23 front 5 2670
## 17 midsize 15.6 21 front 6 3080
## 18 small 12.2 29 front 5 2295
## 19 large 19.3 20 front 6 3490
## 20 small 7.4 31 front 4 1845
## 21 small 10.1 23 front 5 2530
## 22 midsize 20.2 21 front 5 3325
## 23 large 20.9 18 rear 6 3950
## 24 small 8.4 46 front 4 1695
## 25 small 12.1 42 front 4 2350
## 26 small 8.0 29 front 5 2345
## 27 small 10.0 22 front 5 2620
## 28 midsize 13.9 20 front 5 2885
## 29 midsize 47.9 17 rear 5 4000
## 30 midsize 28.0 18 front 5 3510
## 31 midsize 35.2 18 rear 4 3515
## 32 midsize 34.3 17 front 6 3695
## 33 large 36.1 18 rear 6 4055
## 34 small 8.3 29 front 4 2325
## 35 small 11.6 28 front 5 2440
## 36 midsize 61.9 19 rear 5 3525
## 37 midsize 14.9 19 rear 5 3610
## 38 small 10.3 29 front 5 2295
## 39 midsize 26.1 18 front 5 3730
## 40 small 11.8 29 front 5 2545
## 41 midsize 21.5 21 front 5 3200
## 42 midsize 16.3 23 front 5 2890
## 43 large 20.7 19 front 6 3470
## 44 small 9.0 31 front 4 2350
## 45 midsize 18.5 19 front 5 3450
## 46 large 24.4 19 front 6 3495
## 47 small 11.1 28 front 5 2495
## 48 small 8.4 33 4WD 4 2045
## 49 small 10.9 25 4WD 5 2490
## 50 small 8.6 39 front 4 1965
## 51 small 9.8 32 front 5 2055
## 52 midsize 18.2 22 front 5 3030
## 53 small 9.1 25 front 4 2240
## 54 midsize 26.7 20 front 5 3245
For now, we will mostly be working with .csv and .xls files. Later in the course, we may discuss other types of files.
Assessing Size:
# dim() spits out dimensions of a dataframe
dim(airbnb)
## [1] 48895 16
Names:
# names() spits out column names of a dataframe
names(airbnb)
Referencing Columns:
airbnb$latitude
airbnb[,3]
airbnb[,"latitude"]
attach(airbnb)
latitude
Calculations:
mean(airbnb$price)
## [1] 152.7207
median(airbnb$price)
## [1] 106
sd(airbnb$price) #standard deviation
## [1] 240.1542
# calculates the mean price, broken down by neighbourhood group
tapply(airbnb$price, airbnb$neighbourhood_group, mean)
## Bronx Brooklyn Manhattan Queens Staten Island
## 87.49679 124.38321 196.87581 99.51765 114.81233
#calculates the mean price, broken down by room type
tapply(airbnb$price, airbnb$room_type, mean)
## Entire home/apt Private room Shared room
## 211.79425 89.78097 70.12759
Conditional Subsetting:
# prints out all the rows where the price is more than 8000
airbnb[airbnb$price >= 8000,]
## id name host_id
## 4378 2953058 Film Location 1177497
## 6531 4737930 Spanish Harlem Apt 1235070
## 9152 7003697 Furnished room in Astoria apartment 20582832
## 12343 9528920 Quiet, Clean, Lit @ LES & Chinatown 3906464
## 17693 13894339 Luxury 1 bedroom apt. -stunning Manhattan views 5143901
## 29239 22436899 1-BR Lincoln Center 72390391
## 30269 23377410 Beautiful/Spacious 1 bed luxury flat-TriBeCa/Soho 18128455
## 40434 31340283 2br - The Heart of NYC: Manhattans Lower East Side 4382127
## host_name neighbourhood_group neighbourhood latitude longitude
## 4378 Jessica Brooklyn Clinton Hill 40.69137 -73.96723
## 6531 Olson Manhattan East Harlem 40.79264 -73.93898
## 9152 Kathrine Queens Astoria 40.76810 -73.91651
## 12343 Amy Manhattan Lower East Side 40.71355 -73.98507
## 17693 Erin Brooklyn Greenpoint 40.73260 -73.95739
## 29239 Jelena Manhattan Upper West Side 40.77213 -73.98665
## 30269 Rum Manhattan Tribeca 40.72197 -74.00633
## 40434 Matt Manhattan Lower East Side 40.71980 -73.98566
## room_type price minimum_nights number_of_reviews last_review
## 4378 Entire home/apt 8000 1 1 2016-09-15
## 6531 Entire home/apt 9999 5 1 2015-01-02
## 9152 Private room 10000 100 2 2016-02-13
## 12343 Private room 9999 99 6 2016-01-01
## 17693 Entire home/apt 10000 5 5 2017-07-27
## 29239 Entire home/apt 10000 30 0
## 30269 Entire home/apt 8500 30 2 2018-09-18
## 40434 Entire home/apt 9999 30 0
## reviews_per_month calculated_host_listings_count availability_365
## 4378 0.03 11 365
## 6531 0.02 1 0
## 9152 0.04 1 0
## 12343 0.14 1 83
## 17693 0.16 1 0
## 29239 NA 1 83
## 30269 0.18 1 251
## 40434 NA 1 365
# prints out all the rows where the neighbourhood group is Manhattan
# note the double equal sign
airbnb[airbnb$neighbourhood_group=="Manhattan",]
hist(airbnb$price)
plot(airbnb$reviews_per_month, airbnb$price)
plot(cars$mpgCity, cars$weight)
boxplot(cars$mpgCity ~ cars$type)
table(cars$type)
##
## large midsize small
## 11 22 21
barplot(table(cars$type))
# before a line of comment)When naming variables, observations, data frames, or files, make them:
Other naming considerations:
filter or mean)surface_temp= surface temperature measurement on Mars in degrees Celsius)Some suggestions for best practices:
purple vs. Purple vs. purple_)NA, NaN, -9999, -); don’t leave cells blankby @alisonhorst
Missing data are usually in the data as NA, NaN, N/A, or -9999. When doing operators on numbers, most functions will return NA if the data includes missing values.
mean(airbnb$reviews_per_month)
## [1] NA
# use arguement na.rm to remove NAs
mean(airbnb$reviews_per_month, na.rm=T)
## [1] 1.373221
#OR
# use function na.omit() to return a vector without NAs, then take the mean
mean(na.omit(airbnb$reviews_per_month))
## [1] 1.373221
Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and plotting.
#to check whether something is a factor
is.factor(cars$type)
## [1] TRUE
# to make something a factor
cars$type <- factor(cars$type)
Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order.
#to see the levels
levels(cars$type)
## [1] "large" "midsize" "small"
#to see the number of levels
nlevels(cars$type)
## [1] 3
table(cars$type)
##
## large midsize small
## 11 22 21
#to change the order or to give order
cars$type <- factor(cars$type, levels = c("small", "midsize", "large"), ordered=T)
min(cars$type)
## [1] small
## Levels: small < midsize < large
Note that if you only want to plot two of the factors (say small and midsize cars).
boxplot(cars$mpgCity~cars$type)